This paper presents the work of data wrangling and exploratory data analysis of different datasets from WeRateDogs' @dog_rates Twitter account. This account rates images of dogs as well as adding funny pavement comments about each dog.
The work is developed from a primary file, sourced from Twitter and provided by Udacity; however this file does not contain all the information from the tweets. The second file corresponds to the product delivered by a learning machine based on an image recognition neural network, with the breed of each of the dogs. Finally, a third file, obtained directly from the Twitter API, is downloaded from the ID of each tweet in the primary file.
This paper presents the work of data wrangling and exploratory data analysis of different datasets from WeRateDogs' @dog_rates Twitter account. This account rates images of dogs as well as adding funny pavement comments about each dog.
The work is developed from a primary file, sourced from Twitter and provided by Udacity; however this file does not contain all the information from the tweets. The second file corresponds to the product delivered by a learning machine based on an image recognition neural network, with the breed of each of the dogs. Finally, a third file, obtained directly from the Twitter API, is downloaded from the ID of each tweet in the primary file.
This analysis will seek to answer the following questions.
# Importing libraries to be used
import pandas as pd
import numpy as np
import tweepy
import requests
import os
import json
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline
# Importing libraries to gather data from Twitter
import tweepy
import json
from timeit import default_timer as timer
# Variables with access keys to Twitter APIs
consumer_key = 'HNW4WY2zjl76y7rk7uX6Xgl9h'
consumer_secret = 'fKMmVTio5b6DuTL5GE3wnxlzQRzNI7GbX679vWJHp1N2gxqMr6'
access_token = '924852049396846593-bTbqKyGkxblMpc0kg5UojtW9IbH2ezk'
access_secret = 'uNIYZw5wsGf058KKbfXOPkRfBfvzXBtMCIJ5KsPkbn8xc'
# Generating objects from classes needed to connect to Twitter
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
# Reading archive provided by Udacity with Tweets.
df_tw = pd.read_csv('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv')
df_tw.shape
(2356, 17)
df_tw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2356 entries, 0 to 2355 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float64 2 in_reply_to_user_id 78 non-null float64 3 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float64 7 retweeted_status_user_id 181 non-null float64 8 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 2356 non-null object 13 doggo 2356 non-null object 14 floofer 2356 non-null object 15 pupper 2356 non-null object 16 puppo 2356 non-null object dtypes: float64(4), int64(3), object(10) memory usage: 313.0+ KB
sum(df_tw.duplicated())
0
# Reading archive provided by Udacity with identification of breeds, using response library
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open('image-predictions.tsv', mode='wb') as file:
file.write(response.content)
# Reading the file downloaded
df_pred = pd.read_csv('image-predictions.tsv', delimiter = '\t')
df_pred.shape
(2075, 12)
df_pred.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075 entries, 0 to 2074 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 2075 non-null int64 1 jpg_url 2075 non-null object 2 img_num 2075 non-null int64 3 p1 2075 non-null object 4 p1_conf 2075 non-null float64 5 p1_dog 2075 non-null bool 6 p2 2075 non-null object 7 p2_conf 2075 non-null float64 8 p2_dog 2075 non-null bool 9 p3 2075 non-null object 10 p3_conf 2075 non-null float64 11 p3_dog 2075 non-null bool dtypes: bool(3), float64(3), int64(2), object(4) memory usage: 152.1+ KB
df_pred.head()
| tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
| 1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
| 2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
| 3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
| 4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
sum(df_pred.duplicated())
0
Next, the Twitter API is used to query detailed information about the Tweets we have; the IDs of each of the Tweets are used to query the details of each of them. The data obtained comes in JSON format and is stored in the tweet_json.txt file.
# The number of tweets to consult
tweet_ids = df_tw.tweet_id.values
len(tweet_ids)
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
# This loop will likely take 20-30 minutes to run because of Twitter's rate limit
for tweet_id in tweet_ids:
count += 1
print(str(count) + ": " + str(tweet_id))
try:
tweet = api.get_status(tweet_id, tweet_mode='extended')
print("Success")
json.dump(tweet._json, outfile)
outfile.write('\n')
except tweepy.TweepError as e:
print("Fail")
fails_dict[tweet_id] = e
pass
end = timer()
print(end - start)
print(fails_dict)
print(end - start)
# Loading text file in JSON format into a Dataframe.
df_tw_details = pd.read_json('tweet_json.txt', orient='records', lines=True)
# Se carga el dataframe en una estructura JSON para luego aplicar normalización y alcanzar los sub niveles del JSON.
json_struct = json.loads(df_tw_details.to_json(orient="records"))
df_flat = pd.json_normalize(json_struct)
df_flat.shape
(2331, 326)
df_flat.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2331 entries, 0 to 2330 Columns: 326 entries, created_at to place.bounding_box.coordinates dtypes: bool(18), float64(74), int64(11), object(223) memory usage: 5.5+ MB
df_flat.head()
| created_at | id | id_str | full_text | truncated | display_text_range | source | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | ... | place.id | place.url | place.place_type | place.name | place.full_name | place.country_code | place.country | place.contained_within | place.bounding_box.type | place.bounding_box.coordinates | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1501604636000 | 892420643555336193 | 892420643555336192 | This is Phineas. He's a mystical boy. Only eve... | False | [0, 85] | <a href="http://twitter.com/download/iphone" r... | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1501546647000 | 892177421306343426 | 892177421306343424 | This is Tilly. She's just checking pup on you.... | False | [0, 138] | <a href="http://twitter.com/download/iphone" r... | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 1501460283000 | 891815181378084864 | 891815181378084864 | This is Archie. He is a rare Norwegian Pouncin... | False | [0, 121] | <a href="http://twitter.com/download/iphone" r... | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 1501430331000 | 891689557279858688 | 891689557279858688 | This is Darla. She commenced a snooze mid meal... | False | [0, 79] | <a href="http://twitter.com/download/iphone" r... | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 1501344024000 | 891327558926688256 | 891327558926688256 | This is Franklin. He would like you to stop ca... | False | [0, 138] | <a href="http://twitter.com/download/iphone" r... | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 326 columns
Because the flattened data source contains 326 columns and is then joined to another dataframe, and in order to "slim down" this dataframe, columns are selected on the basis that they are useful for the purposes of the analysis.
df_twAPI = df_flat[['id','created_at','is_quote_status','retweet_count','favorite_count']].copy()
df_tw.columns
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
dtype='object')
#Joining the datasets, applying Left Join with the file obtained from the Twitter API (JSON) on the left side.
# In order to keep the tweets up to date.
df = pd.merge(df_twAPI, df_tw, left_on='id', right_on='tweet_id', how='left')
from IPython.display import display
pd.options.display.max_columns = None
df[df['id'] == 878404777348136964]
| id | created_at | is_quote_status | retweet_count | favorite_count | tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 72 | 878404777348136964 | 1498262993000 | False | 1119 | 0 | 878404777348136964 | NaN | NaN | 2017-06-24 00:09:53 +0000 | <a href="http://twitter.com/download/iphone" r... | RT @dog_rates: Meet Shadow. In an attempt to r... | 8.782815e+17 | 4.196984e+09 | 2017-06-23 16:00:04 +0000 | https://www.gofundme.com/3yd6y1c,https://twitt... | 13 | 10 | Shadow | None | None | None | None |
By leaving the file from the API on the left side of the merge, the resulting dataframe has twenty-five fewer tweets; for the purposes of this study, this is beneficial. This may seem paradoxical, but the missing tweets are tweets that no longer exist, probably deleted by the users themselves, so leaving them out of the analysis has been considered beneficial to obtain updated results.
Merging the data frames, applying Left Join to add the data corresponding to the race predictions to the data coming from Twitter. However, it is necessary to specify that only the breed most likely to be the one correctly predicted by the neural network from which these data are derived has been retained.
# Joining dataframes applying Left Join, leaving on the left the data with the detail of the tweets.
df_merged = pd.merge(df, df_pred[['tweet_id', 'jpg_url', 'p1', 'p1_conf', 'p1_dog']], left_on='id', right_on='tweet_id', how='left')
df_merged.head()
| id | created_at | is_quote_status | retweet_count | favorite_count | tweet_id_x | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | tweet_id_y | jpg_url | p1 | p1_conf | p1_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | 1501604636000 | False | 7359 | 35017 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None | 8.924206e+17 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | orange | 0.097049 | False |
| 1 | 892177421306343426 | 1501546647000 | False | 5487 | 30334 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None | 8.921774e+17 | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | Chihuahua | 0.323581 | True |
| 2 | 891815181378084864 | 1501460283000 | False | 3628 | 22804 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | 12 | 10 | Archie | None | None | None | None | 8.918152e+17 | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | Chihuahua | 0.716012 | True |
| 3 | 891689557279858688 | 1501430331000 | False | 7540 | 38304 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None | 8.916896e+17 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | paper_towel | 0.170278 | False |
| 4 | 891327558926688256 | 1501344024000 | False | 8126 | 36573 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | 12 | 10 | Franklin | None | None | None | None | 8.913276e+17 | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | basset | 0.555712 | True |
# Saving the resulting dataset in a CSV file
df_merged.to_csv(path_or_buf='gathered.csv', index=False)
Once the data has been collected in a single dataframe, the process continues with its evaluation.
In the assess stage, the quality of the data and its tidiness is evaluated.
From the quality point of view:
In terms of tidiness, the dataset is expected to be neat and tidy, complying with:
# Reading from the CSV file
df = pd.read_csv('gathered.csv')
# Reviewing the dataset
df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2331 entries, 0 to 2330 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 2331 non-null int64 1 created_at 2331 non-null int64 2 is_quote_status 2331 non-null bool 3 retweet_count 2331 non-null int64 4 favorite_count 2331 non-null int64 5 tweet_id_x 2331 non-null int64 6 in_reply_to_status_id 78 non-null float64 7 in_reply_to_user_id 78 non-null float64 8 timestamp 2331 non-null object 9 source 2331 non-null object 10 text 2331 non-null object 11 retweeted_status_id 163 non-null float64 12 retweeted_status_user_id 163 non-null float64 13 retweeted_status_timestamp 163 non-null object 14 expanded_urls 2272 non-null object 15 rating_numerator 2331 non-null int64 16 rating_denominator 2331 non-null int64 17 name 2331 non-null object 18 doggo 2331 non-null object 19 floofer 2331 non-null object 20 pupper 2331 non-null object 21 puppo 2331 non-null object 22 tweet_id_y 2059 non-null float64 23 jpg_url 2059 non-null object 24 p1 2059 non-null object 25 p1_conf 2059 non-null float64 26 p1_dog 2059 non-null object dtypes: bool(1), float64(6), int64(7), object(13) memory usage: 475.9+ KB
df.head()
| id | created_at | is_quote_status | retweet_count | favorite_count | tweet_id_x | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | tweet_id_y | jpg_url | p1 | p1_conf | p1_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | 1501604636000 | False | 7359 | 35017 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None | 8.924206e+17 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | orange | 0.097049 | False |
| 1 | 892177421306343426 | 1501546647000 | False | 5487 | 30334 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | 13 | 10 | Tilly | None | None | None | None | 8.921774e+17 | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | Chihuahua | 0.323581 | True |
| 2 | 891815181378084864 | 1501460283000 | False | 3628 | 22804 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | 12 | 10 | Archie | None | None | None | None | 8.918152e+17 | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | Chihuahua | 0.716012 | True |
| 3 | 891689557279858688 | 1501430331000 | False | 7540 | 38304 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None | 8.916896e+17 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | paper_towel | 0.170278 | False |
| 4 | 891327558926688256 | 1501344024000 | False | 8126 | 36573 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | 12 | 10 | Franklin | None | None | None | None | 8.913276e+17 | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | basset | 0.555712 | True |
# Visal review of numerical variables
df.hist(figsize=(8, 8));
df.describe()
| id | created_at | retweet_count | favorite_count | tweet_id_x | in_reply_to_status_id | in_reply_to_user_id | retweeted_status_id | retweeted_status_user_id | rating_numerator | rating_denominator | tweet_id_y | p1_conf | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.331000e+03 | 2.331000e+03 | 2331.000000 | 2331.000000 | 2.331000e+03 | 7.800000e+01 | 7.800000e+01 | 1.630000e+02 | 1.630000e+02 | 2331.000000 | 2331.000000 | 2.059000e+03 | 2059.000000 |
| mean | 7.419079e+17 | 1.465720e+12 | 2589.222651 | 7313.884170 | 7.419079e+17 | 7.455079e+17 | 2.014171e+16 | 7.690768e+17 | 1.378818e+16 | 13.134706 | 10.459888 | 7.378345e+17 | 0.594176 |
| std | 6.823170e+16 | 1.626770e+10 | 4379.857136 | 11360.046929 | 6.823170e+16 | 7.582492e+16 | 1.252797e+17 | 6.168036e+16 | 1.010910e+17 | 46.121847 | 6.781159 | 6.758411e+16 | 0.271161 |
| min | 6.660209e+17 | 1.447627e+12 | 1.000000 | 0.000000 | 6.660209e+17 | 6.658147e+17 | 1.185634e+07 | 6.661041e+17 | 7.832140e+05 | 0.000000 | 0.000000 | 6.660209e+17 | 0.044333 |
| 25% | 6.782670e+17 | 1.450546e+12 | 525.000000 | 1268.000000 | 6.782670e+17 | 6.757419e+17 | 3.086374e+08 | 7.118465e+17 | 4.196984e+09 | 10.000000 | 10.000000 | 6.762505e+17 | 0.363272 |
| 50% | 7.182469e+17 | 1.460078e+12 | 1207.000000 | 3174.000000 | 7.182469e+17 | 7.038708e+17 | 4.196984e+09 | 7.798343e+17 | 4.196984e+09 | 11.000000 | 10.000000 | 7.113067e+17 | 0.587764 |
| 75% | 7.986692e+17 | 1.479253e+12 | 2997.500000 | 8942.500000 | 7.986692e+17 | 8.257804e+17 | 4.196984e+09 | 8.190056e+17 | 4.196984e+09 | 12.000000 | 10.000000 | 7.928986e+17 | 0.843855 |
| max | 8.924206e+17 | 1.501605e+12 | 74311.000000 | 150565.000000 | 8.924206e+17 | 8.862664e+17 | 8.405479e+17 | 8.860537e+17 | 7.874618e+17 | 1776.000000 | 170.000000 | 8.924206e+17 | 1.000000 |
df[df['p1_dog'] == False]
| id | created_at | is_quote_status | retweet_count | favorite_count | tweet_id_x | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | rating_numerator | rating_denominator | name | doggo | floofer | pupper | puppo | tweet_id_y | jpg_url | p1 | p1_conf | p1_dog | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | 1501604636000 | False | 7359 | 35017 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | 13 | 10 | Phineas | None | None | None | None | 8.924206e+17 | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | orange | 0.097049 | False |
| 3 | 891689557279858688 | 1501430331000 | False | 7540 | 38304 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | 13 | 10 | Darla | None | None | None | None | 8.916896e+17 | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | paper_towel | 0.170278 | False |
| 21 | 887517139158093824 | 1500435549000 | False | 10278 | 42168 | 887517139158093824 | NaN | NaN | 2017-07-19 03:39:09 +0000 | <a href="http://twitter.com/download/iphone" r... | I've yet to rate a Venezuelan Hover Wiener. Th... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/887517139... | 14 | 10 | such | None | None | None | None | 8.875171e+17 | https://pbs.twimg.com/ext_tw_video_thumb/88751... | limousine | 0.130432 | False |
| 27 | 886680336477933568 | 1500236040000 | False | 3915 | 20462 | 886680336477933568 | NaN | NaN | 2017-07-16 20:14:00 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Derek. He's late for a dog meeting. 13... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/886680336... | 13 | 10 | Derek | None | None | None | None | 8.866803e+17 | https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg | convertible | 0.738995 | False |
| 51 | 882045870035918850 | 1499131097000 | False | 4247 | 26658 | 882045870035918850 | NaN | NaN | 2017-07-04 01:18:17 +0000 | <a href="http://twitter.com/download/iphone" r... | This is Koko. Her owner, inspired by Barney, r... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/882045870... | 13 | 10 | Koko | None | None | None | None | 8.820459e+17 | https://pbs.twimg.com/media/DD2oCl2WAAEI_4a.jpg | web_site | 0.949591 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2309 | 666293911632134144 | 1447691822000 | False | 306 | 447 | 666293911632134144 | NaN | NaN | 2015-11-16 16:37:02 +0000 | <a href="http://twitter.com/download/iphone" r... | This is a funny dog. Weird toes. Won't come do... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666293911... | 3 | 10 | a | None | None | None | None | 6.662939e+17 | https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg | three-toed_sloth | 0.914671 | False |
| 2312 | 666268910803644416 | 1447685861000 | False | 32 | 93 | 666268910803644416 | NaN | NaN | 2015-11-16 14:57:41 +0000 | <a href="http://twitter.com/download/iphone" r... | Very concerned about fellow dog trapped in com... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666268910... | 10 | 10 | None | None | None | None | None | 6.662689e+17 | https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg | desktop_computer | 0.086502 | False |
| 2313 | 666104133288665088 | 1447646575000 | False | 5732 | 13157 | 666104133288665088 | NaN | NaN | 2015-11-16 04:02:55 +0000 | <a href="http://twitter.com/download/iphone" r... | Not familiar with this breed. No tail (weird).... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666104133... | 1 | 10 | None | None | None | None | None | 6.661041e+17 | https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg | hen | 0.965932 | False |
| 2322 | 666057090499244032 | 1447635359000 | False | 118 | 263 | 666057090499244032 | NaN | NaN | 2015-11-16 00:55:59 +0000 | <a href="http://twitter.com/download/iphone" r... | My oh my. This is a rare blond Canadian terrie... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666057090... | 9 | 10 | a | None | None | None | None | 6.660571e+17 | https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg | shopping_cart | 0.962465 | False |
| 2324 | 666051853826850816 | 1447634111000 | False | 740 | 1081 | 666051853826850816 | NaN | NaN | 2015-11-16 00:35:11 +0000 | <a href="http://twitter.com/download/iphone" r... | This is an odd dog. Hard on the outside but lo... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666051853... | 2 | 10 | an | None | None | None | None | 6.660519e+17 | https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg | box_turtle | 0.933012 | False |
539 rows × 27 columns
df['rating_numerator'].value_counts()
12 552 11 464 10 458 13 339 9 157 8 102 7 55 14 51 5 37 6 32 3 19 4 17 1 9 2 9 0 2 15 2 75 2 420 2 182 1 204 1 143 1 121 1 99 1 20 1 45 1 27 1 17 1 24 1 26 1 44 1 50 1 60 1 80 1 84 1 88 1 1776 1 960 1 666 1 144 1 165 1 Name: rating_numerator, dtype: int64
df['rating_numerator'].plot.box()
<AxesSubplot:>
df['rating_denominator'].value_counts()
10 2309 50 3 11 2 20 2 80 2 0 1 120 1 7 1 170 1 150 1 130 1 90 1 110 1 2 1 70 1 40 1 16 1 15 1 Name: rating_denominator, dtype: int64
df['rating_denominator'].plot.box()
<AxesSubplot:>
After visually and programmatically reviewing the dataset, it can be affirmed that:
df_clean = df.copy()
# Counting the number of null values
df_clean.expanded_urls.isna().sum()
59
# Completing null values with the word 'Uninformed'
df_clean['expanded_urls'].fillna('Uninformed', inplace=True)
df_clean.expanded_urls.isna().sum()
0
df_clean.columns
Index(['id', 'created_at', 'is_quote_status', 'retweet_count',
'favorite_count', 'tweet_id_x', 'in_reply_to_status_id',
'in_reply_to_user_id', 'timestamp', 'source', 'text',
'retweeted_status_id', 'retweeted_status_user_id',
'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
'tweet_id_y', 'jpg_url', 'p1', 'p1_conf', 'p1_dog'],
dtype='object')
df_clean = df_clean.rename(columns={'p1': 'breed', 'p1_conf': '%_reliab', 'p1_dog': 'is_breed'})
df_clean.columns
Index(['id', 'created_at', 'is_quote_status', 'retweet_count',
'favorite_count', 'tweet_id_x', 'in_reply_to_status_id',
'in_reply_to_user_id', 'timestamp', 'source', 'text',
'retweeted_status_id', 'retweeted_status_user_id',
'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
'tweet_id_y', 'jpg_url', 'breed', '%_reliab', 'is_breed'],
dtype='object')
df_clean[['source']].value_counts()
source <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> 2197 <a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a> 91 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a> 33 <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a> 10 dtype: int64
df_clean[['source']] = df_clean.source.str.split('>', expand = True)[1]
df_clean[['source']] = df_clean.source.str.split('<', expand = True)[0]
df_clean[['source']].value_counts()
source Twitter for iPhone 2197 Vine - Make a Scene 91 Twitter Web Client 33 TweetDeck 10 dtype: int64
df_clean.shape
(2331, 27)
df_clean[['name','breed','is_breed']][df_clean['is_breed'] == False].head()
| name | breed | is_breed | |
|---|---|---|---|
| 0 | Phineas | orange | False |
| 3 | Darla | paper_towel | False |
| 21 | such | limousine | False |
| 27 | Derek | convertible | False |
| 51 | Koko | web_site | False |
df_clean.loc[df_clean['is_breed'] == False, 'breed'] = 'Uninformed'
df_clean[['name','breed','is_breed']][df_clean['is_breed'] == False].head()
| name | breed | is_breed | |
|---|---|---|---|
| 0 | Phineas | Uninformed | False |
| 3 | Darla | Uninformed | False |
| 21 | such | Uninformed | False |
| 27 | Derek | Uninformed | False |
| 51 | Koko | Uninformed | False |
The statement of this study asks not to consider retweets. After reviewing the dataset, it has been observed that the column 'retweeted_status_id' contains values just when the record corresponds to a retweet. Therefore, the valid values according to the statement contain a null value in this column.
df_clean['retweeted_status_id'].notna().sum()
163
# Get names of indexes for which column 'retweeted_status_id' is not a NaN.
indexNames = df_clean[ df_clean['retweeted_status_id'].notna() ].index
# Delete these row indexes from dataFrame
df_clean.drop(indexNames , inplace=True)
df_clean['retweeted_status_id'].notna().sum()
0
df_clean.shape
(2168, 27)
An exhaustive review of the cases where there is no information regarding breed prediction was carried out, in all cases corresponding to tweets without images. According to the requirement presented for this project, these records are not considered for the analysis.
df_clean[['expanded_urls','jpg_url','breed']][df_clean['breed'].isna()]
| expanded_urls | jpg_url | breed | |
|---|---|---|---|
| 29 | Uninformed | NaN | NaN |
| 34 | https://twitter.com/4bonds2carbon/status/88551... | NaN | NaN |
| 41 | https://twitter.com/kaijohnson_19/status/88396... | NaN | NaN |
| 54 | Uninformed | NaN | NaN |
| 63 | Uninformed | NaN | NaN |
| ... | ... | ... | ... |
| 2013 | Uninformed | NaN | NaN |
| 2124 | Uninformed | NaN | NaN |
| 2164 | Uninformed | NaN | NaN |
| 2187 | https://vine.co/v/ea0OwvPTx9l | NaN | NaN |
| 2273 | Uninformed | NaN | NaN |
181 rows × 3 columns
df_clean = df_clean[df_clean['breed'].notna()]
df_clean.shape
(1987, 27)
df_clean[['expanded_urls','jpg_url','breed']][df_clean['breed'].isna()].sum()
expanded_urls 0.0 jpg_url 0.0 breed 0.0 dtype: float64
df_clean['rating_numerator'].plot.box()
<AxesSubplot:>
df_clean['rating_numerator'].describe()
count 1987.000000 mean 12.280825 std 41.570741 min 0.000000 25% 10.000000 50% 11.000000 75% 12.000000 max 1776.000000 Name: rating_numerator, dtype: float64
median = df_clean['rating_numerator'].median()
std = df_clean['rating_numerator'].std()
outliers = (df_clean['rating_numerator'] - median).abs() > (std * 2)
#df['rating_numerator'][outliers] = np.nan
df_clean.loc[outliers, 'rating_numerator'] = np.nan
df_clean['rating_numerator'].fillna(median, inplace=True)
df_clean['rating_numerator'].describe()
count 1987.000000 mean 10.779064 std 4.304264 min 0.000000 25% 10.000000 50% 11.000000 75% 12.000000 max 88.000000 Name: rating_numerator, dtype: float64
median = df_clean['rating_numerator'].median()
std = df_clean['rating_numerator'].std()
outliers = (df_clean['rating_numerator'] - median).abs() > (std * 2)
df_clean.loc[outliers, 'rating_numerator'] = np.nan
df_clean['rating_numerator'].fillna(median, inplace=True)
df_clean['rating_numerator'].plot.box()
<AxesSubplot:>
df_clean['rating_numerator'].describe()
count 1987.000000 mean 10.613488 std 2.030874 min 3.000000 25% 10.000000 50% 11.000000 75% 12.000000 max 14.000000 Name: rating_numerator, dtype: float64
df_clean['rating_denominator'].plot.box()
<AxesSubplot:>
df_clean['rating_denominator'].describe()
count 1987.000000 mean 10.533971 std 7.333532 min 2.000000 25% 10.000000 50% 10.000000 75% 10.000000 max 170.000000 Name: rating_denominator, dtype: float64
median = df_clean['rating_denominator'].median()
std = df_clean['rating_denominator'].std()
outliers = (df_clean['rating_denominator'] - median).abs() > (std * 2)
df_clean.loc[outliers, 'rating_denominator'] = np.nan
df_clean['rating_denominator'].fillna(median, inplace=True)
df_clean['rating_denominator'].describe()
count 1987.000000 mean 10.000503 std 0.296844 min 2.000000 25% 10.000000 50% 10.000000 75% 10.000000 max 20.000000 Name: rating_denominator, dtype: float64
median = df_clean['rating_denominator'].median()
std = df_clean['rating_denominator'].std()
outliers = (df_clean['rating_denominator'] - median).abs() > (std * 2)
df_clean.loc[outliers, 'rating_denominator'] = np.nan
df_clean['rating_denominator'].fillna(median, inplace=True)
df_clean['rating_denominator'].plot.box()
<AxesSubplot:>
df_clean['rating_denominator'].describe()
count 1987.0 mean 10.0 std 0.0 min 10.0 25% 10.0 50% 10.0 75% 10.0 max 10.0 Name: rating_denominator, dtype: float64
Due to sucesive merges, there are some repeated columns as 'tweet_id_x', 'tweet_id_y'; already exists 'id. Also, there are 'created_at' and 'timestamp' with the same information, different format; for that reason 'timestamp' stays, but with a more readable name.
At the same time, exists columns with all its null values. There are: 'retweeted_status_id', 'retweeted_status_user_id' and 'retweeted_status_timestamp'.
Finally, it is also decided to remove the columns 'in_reply_to_status_id' and 'in_reply_to_user_id'; they contain only 23 non-null values and do not answer any of the questions targeted by the analysis to be carried out.
df_clean.drop(columns=['tweet_id_x','tweet_id_y', 'created_at', 'retweeted_status_id',
'retweeted_status_user_id','retweeted_status_timestamp',
'in_reply_to_status_id','in_reply_to_user_id'], inplace=True)
df_clean = df_clean.rename(columns={'timestamp': 'created_at'})
df_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1987 entries, 0 to 2330 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 1987 non-null int64 1 is_quote_status 1987 non-null bool 2 retweet_count 1987 non-null int64 3 favorite_count 1987 non-null int64 4 created_at 1987 non-null object 5 source 1987 non-null object 6 text 1987 non-null object 7 expanded_urls 1987 non-null object 8 rating_numerator 1987 non-null float64 9 rating_denominator 1987 non-null float64 10 name 1987 non-null object 11 doggo 1987 non-null object 12 floofer 1987 non-null object 13 pupper 1987 non-null object 14 puppo 1987 non-null object 15 jpg_url 1987 non-null object 16 breed 1987 non-null object 17 %_reliab 1987 non-null float64 18 is_breed 1987 non-null object dtypes: bool(1), float64(3), int64(3), object(12) memory usage: 296.9+ KB
df_clean.columns
Index(['id', 'is_quote_status', 'retweet_count', 'favorite_count',
'created_at', 'source', 'text', 'expanded_urls', 'rating_numerator',
'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
'jpg_url', 'breed', '%_reliab', 'is_breed'],
dtype='object')
df_clean.shape
(1987, 19)
""" Returns the value corresponding to the size of the dog, as reported in one of the four columns provided.
The aim is to leave it in a single column.
"""
def define_size(row):
if row['doggo'] == 'doggo' and row['floofer'] == 'None' and row['pupper'] == 'None' and row['puppo'] == 'None':
return 'doggo'
elif row['doggo'] == 'None' and row['floofer'] == 'floofer' and row['pupper'] == 'None' and row['puppo'] == 'None':
return 'floofer'
elif row['doggo'] == 'None' and row['floofer'] == 'None' and row['pupper'] == 'pupper' and row['puppo'] == 'None':
return 'pupper'
elif row['doggo'] == 'None' and row['floofer'] == 'None' and row['pupper'] == 'None' and row['puppo'] == 'puppo':
return 'puppo'
else:
return np.nan
# This lambda function, traverses the dataset applying the define_size() function
df_clean['size'] = df_clean.apply (lambda row: define_size(row), axis=1)
df_clean = df_clean.drop(columns=['doggo','floofer','pupper','puppo'])
df_clean.columns
Index(['id', 'is_quote_status', 'retweet_count', 'favorite_count',
'created_at', 'source', 'text', 'expanded_urls', 'rating_numerator',
'rating_denominator', 'name', 'jpg_url', 'breed', '%_reliab',
'is_breed', 'size'],
dtype='object')
df_clean.shape
(1987, 16)
df_clean[['url_1','url_2','url_3','url_4','url_5']] = df_clean.expanded_urls.str.split(',', expand = True)
df_clean.drop(columns=['expanded_urls'], inplace=True)
df_clean.columns
Index(['id', 'is_quote_status', 'retweet_count', 'favorite_count',
'created_at', 'source', 'text', 'rating_numerator',
'rating_denominator', 'name', 'jpg_url', 'breed', '%_reliab',
'is_breed', 'size', 'url_1', 'url_2', 'url_3', 'url_4', 'url_5'],
dtype='object')
df_clean.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1987 entries, 0 to 2330 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 1987 non-null int64 1 is_quote_status 1987 non-null bool 2 retweet_count 1987 non-null int64 3 favorite_count 1987 non-null int64 4 created_at 1987 non-null object 5 source 1987 non-null object 6 text 1987 non-null object 7 rating_numerator 1987 non-null float64 8 rating_denominator 1987 non-null float64 9 name 1987 non-null object 10 jpg_url 1987 non-null object 11 breed 1987 non-null object 12 %_reliab 1987 non-null float64 13 is_breed 1987 non-null object 14 size 294 non-null object 15 url_1 1987 non-null object 16 url_2 506 non-null object 17 url_3 226 non-null object 18 url_4 115 non-null object 19 url_5 8 non-null object dtypes: bool(1), float64(3), int64(3), object(13) memory usage: 312.4+ KB
df_clean.to_csv('twitter_archive_master.csv', index=False)
With a clean and archived dataset, it is possible to proceed with exploratory analysis of these data.
""" Function that draw a boxplot using plotly library, with all points graphed.
Only accept up to two columns, for two dimensions of the graph
IN:
- df: Pandas Dataframe with data to be used for graph
- columns: list with columns of the dataframe to be graphed
OUT:
Only in case of error, prints a message noticing
"""
def boxplot(df, columns, y_range):
if len(columns) == 1:
fig = go.Figure(px.box(df[columns[0]], y=columns[0]))
elif len(columns) == 2:
fig = go.Figure(px.box(df[[columns[0],columns[1]]], x=columns[0], y=columns[1]))
else:
print("Only one or two dimensions; no more, no less.")
return
fig.update_yaxes(range=y_range)
fig.show()
""" Function that draw a bubbles graph using plotly library.
Only accept two columns, for two dimensions of the graph
IN:
- df: Pandas Dataframe with data to be used for graph
- columns: list with columns of the dataframe to be graphed
"""
def bubbles(df, columns):
fig = px.scatter(df, x=columns[0], y=columns[1], size=columns[2],
color=columns[3], hover_name=columns[4])
fig.show()
""" Function that draw a scatterplot using plotly library.
Only accept two columns, for two dimensions of the graph
IN:
- df: Pandas Dataframe with data to be used for graph
- columns: list with columns of the dataframe to be graphed
OUT:
Only in case of error, prints a message noticing
"""
def scatterplot(df, columns):
if len(columns) == 2:
fig = go.Figure(px.scatter(df, x=columns[0], y=columns[1], opacity=0.3,
trendline='ols', trendline_color_override='darkblue'))
fig.show()
else:
print("Only two dimensions; no more, no less.")
return
df = pd.read_csv('twitter_archive_master.csv')
df.head()
| id | is_quote_status | retweet_count | favorite_count | created_at | source | text | rating_numerator | rating_denominator | name | jpg_url | breed | %_reliab | is_breed | size | url_1 | url_2 | url_3 | url_4 | url_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | False | 7359 | 35017 | 2017-08-01 16:23:56 +0000 | Twitter for iPhone | This is Phineas. He's a mystical boy. Only eve... | 13.0 | 10.0 | Phineas | https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg | Uninformed | 0.097049 | False | NaN | https://twitter.com/dog_rates/status/892420643... | NaN | NaN | NaN | NaN |
| 1 | 892177421306343426 | False | 5487 | 30334 | 2017-08-01 00:17:27 +0000 | Twitter for iPhone | This is Tilly. She's just checking pup on you.... | 13.0 | 10.0 | Tilly | https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg | Chihuahua | 0.323581 | True | NaN | https://twitter.com/dog_rates/status/892177421... | NaN | NaN | NaN | NaN |
| 2 | 891815181378084864 | False | 3628 | 22804 | 2017-07-31 00:18:03 +0000 | Twitter for iPhone | This is Archie. He is a rare Norwegian Pouncin... | 12.0 | 10.0 | Archie | https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg | Chihuahua | 0.716012 | True | NaN | https://twitter.com/dog_rates/status/891815181... | NaN | NaN | NaN | NaN |
| 3 | 891689557279858688 | False | 7540 | 38304 | 2017-07-30 15:58:51 +0000 | Twitter for iPhone | This is Darla. She commenced a snooze mid meal... | 13.0 | 10.0 | Darla | https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg | Uninformed | 0.170278 | False | NaN | https://twitter.com/dog_rates/status/891689557... | NaN | NaN | NaN | NaN |
| 4 | 891327558926688256 | False | 8126 | 36573 | 2017-07-29 16:00:24 +0000 | Twitter for iPhone | This is Franklin. He would like you to stop ca... | 12.0 | 10.0 | Franklin | https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg | basset | 0.555712 | True | NaN | https://twitter.com/dog_rates/status/891327558... | https://twitter.com/dog_rates/status/891327558... | NaN | NaN | NaN |
# Generating a new dataset with the name of the breed and the count of it
breeds = df[['breed', 'id']].groupby(['breed']).count()
breeds.rename(columns = {'id':'count'}, inplace = True)
breeds.reset_index(inplace = True)
breeds = breeds[breeds['breed'] != 'Uninformed']
breeds.sort_values(by='count', ascending=False).head(30)
| breed | count | |
|---|---|---|
| 88 | golden_retriever | 139 |
| 35 | Labrador_retriever | 93 |
| 47 | Pembroke | 88 |
| 15 | Chihuahua | 79 |
| 100 | pug | 54 |
| 80 | chow | 41 |
| 53 | Samoyed | 39 |
| 48 | Pomeranian | 38 |
| 107 | toy_poodle | 37 |
| 94 | malamute | 29 |
| 82 | cocker_spaniel | 27 |
| 22 | French_bulldog | 25 |
| 14 | Chesapeake_Bay_retriever | 23 |
| 96 | miniature_pinscher | 22 |
| 58 | Siberian_husky | 20 |
| 23 | German_shepherd | 20 |
| 21 | Eskimo_dog | 18 |
| 56 | Shetland_sheepdog | 18 |
| 39 | Maltese_dog | 18 |
| 59 | Staffordshire_bullterrier | 18 |
| 13 | Cardigan | 17 |
| 71 | beagle | 17 |
| 50 | Rottweiler | 17 |
| 57 | Shih-Tzu | 17 |
| 36 | Lakeland_terrier | 16 |
| 33 | Italian_greyhound | 16 |
| 93 | kuvasz | 16 |
| 27 | Great_Pyrenees | 14 |
| 67 | West_Highland_white_terrier | 14 |
| 70 | basset | 13 |
# In order to generate readable information, most common breeds are selected
top_breeds = breeds[breeds['count'] >= 10]
px.pie(top_breeds, values='count', names='breed')
Most common breed, are the Retrievers. Golden Retriever with 12.1% and Labrador Retriever with 8.08% are most preferred; together they add up to 20% aprox.
# Similar to breeds, here, most common names are selected.
names = df[['name', 'id']].groupby(['name']).count()
names.rename(columns = {'id':'sum'}, inplace = True)
names.reset_index(inplace = True)
names = names[names['name'] != 'None']
names = names[names['name'] != 'a']
names.sort_values(by='sum', ascending=False).head(30)
names = names[names['sum'] >= 5]
px.pie(names, values='sum', names='name')
Charlie, Cooper and Oliver are the most common names of the dogs in WeRateDogs.
# Creating new column for rating
df[['rating']] =df['rating_numerator']/df['rating_denominator']
boxplot(pd.merge(top_breeds[['breed']], df[['breed','rating']], left_on='breed', right_on='breed', how='left'),['breed','rating'],[0.4,1.5])
Eskimo_dog, Samoyed and Cardigan are the top breeds, rated by the Twitter account @dog_rates.
# New column is generated with the sum of retweets and likes.
df[['likes_and_rt']] = df['retweet_count'] + df['favorite_count']
# New dataframe is generated, with the count and the sum of rows with every breed.
breeds_sum = pd.merge(breeds, df[['breed', 'likes_and_rt']].groupby(['breed']).sum(), on=['breed','breed'], how='left')
breeds_sum.head()
| breed | count | likes_and_rt | |
|---|---|---|---|
| 0 | Afghan_hound | 3 | 60773 |
| 1 | Airedale | 12 | 70022 |
| 2 | American_Staffordshire_terrier | 12 | 84982 |
| 3 | Appenzeller | 2 | 15392 |
| 4 | Australian_terrier | 2 | 24983 |
# Adding a new column with median of rating for every breed
breeds_sum[['rating']] = pd.merge(breeds, df[['breed', 'rating']].groupby(['breed']).median(), on=['breed','breed'], how='left')['rating']
breeds_sum.head()
| breed | count | likes_and_rt | rating | |
|---|---|---|---|---|
| 0 | Afghan_hound | 3 | 60773 | 1.30 |
| 1 | Airedale | 12 | 70022 | 1.05 |
| 2 | American_Staffordshire_terrier | 12 | 84982 | 1.05 |
| 3 | Appenzeller | 2 | 15392 | 1.10 |
| 4 | Australian_terrier | 2 | 24983 | 1.15 |
boxplot(pd.merge(top_breeds[['breed']], df[['breed','likes_and_rt']], left_on='breed', right_on='breed', how='left'),['breed','likes_and_rt'],[0,50000])
Breeds more reacted by users in Twitter are: French Bulldog, Cardigan, Basset.
bubbles(breeds_sum,['rating','likes_and_rt','count','breed','breed'])
In this graph, each bubble represents one breed. Size indicates the number of records (rows) with its breed. X-Axis shows the rating given by WeRateDogs and Y-Axis the reactions by Twitter users. With this information is possible affirm that Golden Retriever is the most common, valued by users although not by WeRateDogs, although they rate him very well, he is not in the top. They have better rated dogs from breeds such as Saluki, Aghan Hound and Giant Schnauzer.
scatterplot(df,['rating','likes_and_rt'])
The graph shows no major relationship between the preferences of WeRateDogs and the preferences of the users. Although a positive relationship is generated, its slope does not show a steepness, which indicates that it does not reflect much of a relationship between the two, and its $r^2$ value is low, making the line unrepresentative.
This work consisted mainly of the application of Data Wrangling techniques. Data Wrangling involves three main stages: Gather, Assess and Clean; all of them were applied in this study.
In addition to Data Wrangling, Exploratory Data Analysis work is carried out.
This study consisted of the analysis of the tweet archive of Twitter user @dog_rates, also known as WeRateDogs.
A total of three data sources were considered. One file in CSV format, which in this paper is referenced under the name 'primary', is provided by Udacity; it contains incomplete information on the Tweets to be analysed and the file described in the previous paragraph. The second file is in TSV format, it contains the identification of the breed of each dog published in the CSV file, this identification was performed by machine learning; it is also provided by Udacity. The third source of data corresponds to one generated by this same student, from the data obtained from the Twitter API, to achieve this the ID of each tweet is sent and the detail is received in JSON format, this allows the missing information in the primary file to be completed.
The following data quality problems were identified and corrected:
Identified and corrected tidiness issues.
As for the questions initially raised, they can now be answered: